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Abstract. The UCT algorithm, which combines the UCB algorithm and 
Monte-Carlo Tree Search (MCTS), is currently the most widely used vari¬ 
ant of MCTS. Recently, a number of investigations into applying other 
bandit algorithms to MCTS have produced interesting results. In this 
research, we will investigate the possibility of combining the improved 
UCB algorithm, proposed by Auer et al. [2], with MCTS. However, var¬ 
ious characteristics and properties of the improved UCB algorithm may 
not be ideal for a direct application to MCTS. Therefore, some mod¬ 
ifications were made to the improved UCB algorithm, making it more 
suitable for the task of game tree search. The Mi-UCT algorithm is the 
application of the modified UCB algorithm applied to trees. The perfor¬ 
mance of Mi-UCT is demonstrated on the games of 9 x 9 Go and 9x9 
NoGo, and has shown to outperform the plain UCT algorithm when only 
a small number of playouts are given, and rougly on the same level when 
more playouts are available. 


1 Introduction 

The development of Monte-Carlo Tree Search (MCTS) has made significant im¬ 
pact on various fields of computer game play, especially the field of computer 
Go jjjj]. The UCT algorithm [3] is an MCTS algorithm that combines the UCB 
algorithm [4] and MCTS, by treating each node as a single instance of the multi¬ 
armed bandit problem. The UCT algorithm is one of the most prominent variants 
of the Monte-Carlo Tree Search [8]. 

Recently, various investigations have been carried out on exploring the pos¬ 
sibility of applying other bandit algorithms to MCTS. The application of simple 
regret minizing bandit algorithms has shown the potential to overcome some 
weaknesses of the UCT algorithm [7,. The sequential halving on trees (SHOT) 
0 applies the sequential halving algorithm [TT] to MCTS. The SHOT algorithm 
has various advantages over the UCT algorithm, and has demonstrated better 
performance on the game of NoGo. The H-MCTS algorithm [Sj performs selec¬ 
tion by the SHOT algorithm for nodes that are near to the root and the UCT 
algorithm for deeper nodes. H-MCTS has also shown superiority over the UCT 
in games such as 8 x 8 Amazons and 8x8 AtariGo. Applications of the KL- 
UCB [LJj and Thompson sampling jT3] to MCTS have also been investigated 
and produced some interesting results [TD]. 
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The improved UCB algorithm [2] is a modification of the UCB algorithm, and 
it has been shown that the improved UCB algorithm has a tighter regret upper 
bound than the UCB algorithm. In this research, we will explore the possibility of 
applying the improved UCB algorithm to MCTS. However, some characteristics 
of the improved UCB algorithm may not be desirable for a direct application 
to MCTS. Therefore, we have made some modifications to the improved UCB 
algorithm, making it more suitable for the task of game tree search. We will 
demonstrate the impact and implications of the modifications we have made 
on the improved UCB algorithm in an empirical study under the conventional 
multi-arnred bandit problem setting. We will introduce the Mi-UCT algorithm, 
which is the application of the modified improved UCB algorithm to MCTS. We 
will demonstrate the performance of the Mi-UCB algorithm on the game of 9 x 9 
Go and 9x9 NoGo, which has shown to outperform the plain UCT when given 
a small number of playouts, and roughly on the same level when more playouts 
are given. 


Algorithm 1 The Improved UCB Algorithm 2] 

Input: A set of arms A, total number of trials T 

Initialization: Expected regret Aq ■<— 1, a set of candidates arms Bo <— A 
for rounds m = 0,1, • • •, |_§ log 2 f-J do 


(1) Arm Selection: 

for all arms at £ B m do 

for rim = [ times do 

sample the arm m and update its average reward Wi 

end for 
end for 

(2) Arm Elimination: 

(l„ £- MAXIMUMREWARDARM(B m ) 
for all arms at £ Bm do 

if O’ * + v^) < ( Wmax - then 

remove a,i from B m 

end if 
end for 

(3) Update A m 
Am+1 = Cr¬ 
enel for 
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2 Applying Modified Improved UCB Algorithm to Trees 

In this section we will first introduce the improved UCB algorithm. We will then 
proceed to make some modifications to the improved UCB algorithm, and finally 
show how to apply the modified algorithm to Monte-Carlo Tree Search. 


2.1 Improved UCB Algorithm 

In the multi-armed bandit problem (MAB), a player is faced with a A-armed 
bandit, and the player can decide to pull one of the arms at each play. The bandit 
will produce a reward r € [0,1] according to the arm that has been pulled. The 
distribution of the reward of each arm is unknown to the player. The objective 
of the player is to maximize the total amount of reward over T plays. Bandit 
algorithms are policies that the player can follow to achieve this goal. Equivalent 
to maximizing the total expected reward, bandit algorithms aim to minimize the 
cumulative regret, which is defined as 

Rt = ELi r* - r It , 

where r* is the expected mean reward of the optimal arm, and r/ t is the received 
reward when the player chooses to play arm I t £ A at play t £ T. If a bandit 
algorithm can restrict the cumulative regret to the order of O(logT), it is said 
to be optimal [T]. The UCB algorithm [3], which is used in the UCT algorithm 
[33, is an optimal algorithm which restricts the cumulative regret to 
where A is the difference of expected reward between a suboptimal arm and the 
optimal arm. The improved UCB algorithm [2] is a modification of the UCB 
algorithm, and it can further restrict the growth of the cumulative regret to the 
order of 0( K log ^ rzi I ). 

The improved UCB algorithm, shown in Algorithm [Tj essentially maintains 
a candidate set B m of potential optimal arms, and then proceeds to system¬ 
atically eliminate arms which are estimated to be suboptimal from that set. A 
predetermined number of total plays T is given to the algorithm, and the plays 
are further divided into |_| log 2 (-^)J rounds. Each round consists of three major 
steps. In the first step, the algorithm samples each arm that is in the candidate 
set n m = r 21 ° S ir m) 1 times. Next, the algorithm proceeds to remove the arms 
whose upper bounds of estimated expected reward are less than the lower bound 
of the current best arm. The estimated difference A m is then halved in the final 
step. After each round, the expected reward of the arm is effectively estimated 
as 


Wi ± 


log (TAl, 
2 n m 


= Wi ± 


logjTAD-A^ , An 

41og(Tzi^) 2 


where Wi is the current average reward received from arm a,;. 

In the case when the total number of plays T is not predetermined, the 
improved UCB algorithm can be run in an episodic manner; a total of To = 2 
plays is given to algorithm in the initial episode, and the number of plays of 
subsequent episodes is given by T^+i = Tf. 
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Algorithm 2 Modified Improved UCB Algorithm 


Input: A set of arms A, total number of trials T 

Initialization: Expected regret Aq <— 1, arm count N m <— |A|, plays till Ak update 
Ta 0 4— no ■ N m , where no 4— number of times arm m € A has been 

sampled f * 0. 

for rounds m = 0, 1, • • • T do 

(1) Sample Best Arm: 

/ / log(TA? )-r\ * . m 

flmoi 4 - arg max(wj + v ■ - 2 „, -), where n = — 

ie\A\ v 

Wmax 4 — UpdateMaxWinRate(A) 
ti 4 — ti A 1 

(2) Arm Count Update: 

for all arms at do 

if + \Z 1,Jg 2n^'' ) < ~ \J ) thel1 

N m 4- N m - 1 

end if 
end for 


(3) Update At when Deadline T z \ k is Reached 
if m > T Ak then 

4\ fc+1 = # 

, r 2Iog ( Tzi fe+i)i 

n k +1 <— | -^2- I 

T&k- i-i 4— m A (rifc+i • N m ) 

k 4— it a i 

end if 
end for 
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2.2 Modification of the Improved UCB Algorithm 

Various characteristics of the improved UCB algorithm might be problematic 
for its application to MCTS: 

— Early explorations. The improved UCB algorithm tries to find the op¬ 
timal arm by the process of elimination. Therefore, in order to eliminate 
suboptimal arms as early as possible, it has the tendency to devote more 
plays to suboptimal arms in the early stages. This might not be ideal when 
it comes to MCTS, especially in situations when time and resources are 
rather restricted, because it may end up spending most of the time explor¬ 
ing irrelevant parts of the game tree, rather than searching deeper into more 
promising subtrees. 

— Not an anytime algorithm. The improved UCB algorithm requires the 
total number of plays to be specified beforehand, and its major properties or 
theoretical guarantees may not hold if it is stopped prematurely. Since we are 
considering each node as a single instance of the MAB problem in MCTS, 
internal nodes which are deeper in the tree are most likely the instances that 
are prematurely stopped. The “temporal” solutions provided by these nodes 
might be erroneous, and the effect of these errors may be magnified as they 
propagate upward to the root node. On the other hand, it would be rather 
expensive to ensure the required conditions are met for the improved UCB 
algorithms on each node, because the necessary amount of playouts will grow 
exponentially as the number of expanded node increases. 

Therefore, we have made some adjustments to the improved UCB algorithm 
before applying it to MCTS. 

The modified improved UCB bandit algorithm is shown in Algorithm [21 The 
modifications try to retain the major characteristics of the improved UCB algo¬ 
rithm, especially the way the confidence bounds are updated and maintained. 
Nonetheless, we should note that these modifications will change the algorithm’s 
behaviour, and the theoretical guarantees of the original algorithm may no longer 
be applicable. 


Algorithmic Modifications We have made two major adjustments to the 
algorithmic aspect of the improved UCB algorithm: 

1. Greedy optimistic sampling. We only sample the arm that currently has 
the highest upper bound, rather than sampling every possible arm n m times. 

2. Maintain candidate arm count. We will only maintain the count of po¬ 
tential optimal arms, instead of maintaining a candidate set. 

Since we are only sampling the current best arm, we are effectively performing 
a more aggressive arm elimination; arms that are perceived to be suboptimal 
are not being sampled. Therefore, there is no longer a need for maintaining a 
candidate set. 
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Algorithm 3 Modified Improved UCB Algorithm applied to Trees (Mi-UCT) 
function Mi-UCT(Node N) 
bestucb < -oo 

for all child nodes rii of N do 
if rii.t = 0 then 
rii.ucb oo 
else 

ri N.episodeUpdate/rii.t 

ni . ucb ^ n . w + 

end if 

if bestucb < rii.ucb then 
bestucb «— rii.ucb 

Tlbest ^ Tli 

end if 
end for 

if ribest-times = 0 then 

result •*— RANDOMSlMULATION((nt, est )) 

else 

if ribest is not yet expanded then NodeExpansion ((ribest)) 
result <— Mi-UCT((ni, es t)) 

end if 

N.w <— ( N.w x N.t + result) / (N.t + 1) 

N.t «- N.t + 1 

if N.t > N.T then 
N.A <- 1 

N.T e- N.t + N.T x A.T 

N.armCount Total number of child nodes 

N.k <- 

N.deltaUpdate •<— IV.t + TV.fc x N.armCount 

end if 

if IV.t > N.deltaUpdate then 
for all child nodes rn of N do 

if (m.w+J 2nk --) < (A^-W - ^ 2n , fc - 2 ) then 

N.armCount <— N.armCount — 1 

end if 
end for 


N.A <- ^ 

v./i’ <- r 21og( s jv ^ 2) i 

N.deltaUpdate <— N.t + N.k x N.armCount 

end if 

return result 
end function 

function NODEExPANSlON(Node N) 

N.A <- 1 
N.T 2 

N.armCount <— Total number of child nodes 

Y.fc <- r al ° g( ^^ 2| l 

N.deltaUpdate <— A.fc x N.armCount 

end function 
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However, the confidence bound in the improved UCB algorithm for arm cq 
is defined as Wi ± and the updates of A m and n m are both dictated 

by the number of plays in each round, which is determined by (\B m \ ■ n m ), i.e., 
the total number of plays that is needed to sample each arm in the candidate 
set B m for n m times. Therefore, in order to update the confidence bound we will 
need to maintain the count of potential optimal arms. 

The implication of sampling the current best arm is that the guarantee for 
the estimated bound Wi ± A m to hold will be higher than the improved UCB 
algorithm, because the current best will likely be sampled more or equal to n m 
times. This is desirable in game tree search, since it would be more efficient to 
verify a variation is indeed the principal variation, than trying to identify and 
verify others are suboptimal. 


Confidence Bound Modification Since we have modified the algorithm to 
sample only the current best arm, the confidence bound for the current best 
arm should be tighter than other arms. Hence, an adjustment to the confidence 
bound is also needed. 

In order to reflect the fact that the current best arm is sampled more than 
other arms, we have modified the definition of the confidence bound for arm at 
to 


Wi ± 


log (TA^-n 


2 n„ 


where the factor = |-, and U is the number of times that the arm has been 
sampled. The more arm at is sampled, the smaller n will be, and hence the 
tighter is the confidence bound. Therefore, the expected reward of arm will 
be estimated as 


Wi ± 


log (TA^-ri 


= Wi± 


logiTA^A^-n 

41og(T2i^) 


= Wi ± 


2 ^2 -f- 2 



Since it would be more desirable that the total number of plays is not required 
beforehand, we will run the modified improved UCB algorithm in an episodic 
fashion when we apply it to MCTS, i.e., assigning a total of To = 2 plays to the 
algorithm in the initial episode, and T( + i = T% plays in the subsequent episodes. 
After each episode, all the relevant terms in the confidence bound, such as A rn 
and n m , will be re-initialized, and hence information from previous episodes will 
be lost. Therefore, in order to “share” information across episodes, we will not 
re-initialize after each episode. 


2.3 Modified Improved UCB applied to Trees (Mi-UCT) 

We will now introduce the application of the modified improved UCB algorithm 
to Monte-Carlo Tree Search, or the Mi-UCT algorithm. The details of the Mi- 
UCT algorithm are shown in Algorithm [3] 
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The Mi-UCT algorithm adopts the same game tree expansion paradigm as 
the UCT algorithm, that is, the game tree is expanded over a number of itera¬ 
tions, and each iteration consists of four steps: selection, expansion, simulation, 
and backpropagation [3]- The difference is that the tree policy is replaced by the 
modified improved UCB algorithm. The modified improved UCB on each node 
is run in an episodic manner; a total of To = 2 plays to the algorithm in the 
initial episode, and T^ + \ = T) 2 plays in the subsequent episodes. 

The Mi-UCT algorithm keeps track of when N.A should be updated and 
the starting point of a new episode by using the variables N.deltaUpdate and 
N.T, respectively. When the number of playouts N.t of the node N reaches the 
updating deadline N.deltaUpdate, the algorithm halves the current estimated 
regret N.A and calculates the next deadline for halving N.A. The variable N.T 
marks the starting point of a new episode. Hence, when N.t reaches N.T, the 
related variables N.A and N.armCount are re-initialized, and the starting point 
N.T of the next episode, along with the new N.deltaUpdate are calculated. 

3 Experimental Results 

We will first examine how the various modifications we have made to the im¬ 
proved UCB algorithm affect its performance on the multi-armed bandit prob¬ 
lem. Next, we will demonstrate the performance of the Mi-UCT algorithm against 
the plain UCT algorithm on the game of 9 x 9 Go and 9x9 NoGo. 

3.1 Performance on Multi-armed Bandits Problem 

The experimental settings follow the multi-armed bandit testbed that is specified 
in [5], The results are averaged over 2000 randomly generated if-armed bandit 
tasks. We have set K = 60 to simulate more closely the conditions in which 
bandit algorithms will face when they are applied in MCTS for games that 
have a middle-high branching factor. The reward distribution of each bandit is 
a normal (Gaussian) distribution with the mean Wi, i € I\, and variance 1. The 
mean Wi of each bandit of every generated Tf-armed bandit task was randomly 
selected according to a normal distribution with mean 0 and variance 1. 

The cumulative regret and optimal action percentage are shown in Figure [1] 
and Figure[2l respectively. The various results correspond to different algorithms 
as follows: 

— UCB: the UCB algorithm. 

— I-UCB: the improved UCB algorithm. 

I-UCB (episodic): the improved UCB algorithm ran episodically. 

— Modified I-UCB (no r ): only algorithmic modifications on the improved 
UCB algorithm. 

— Modified I-UCB (no r, episodic): only algorithmic modifications on the 
improved UCB algorithm ran episodically. 

— Modified I-UCB: both algorithmic and confidence bound modifications on 
the improved UCB algorithm. 
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— Modified I-UCB (episodic) : both algorithmic and confidence bound mod¬ 
ifications on the improved UCB algorithm ran episodically. 

Contrary to theoretical analysis, we are surprised to observe the original im¬ 
proved UCB, both I-UCB and I-UCB (episodic), produced the worst cumulative 
regret. However, their optimal action percentages are increasing at a very rapid 
rate, and are likely to overtake the UCB algorithm if more plays are given. This 
suggests that the improved UCB algorithm does indeed devote more plays to 
exploration in the early stages. 

The “slack” in the curves of the algorithms that were run episodically are the 
points when a new episode begins. Since the confidence bounds are essentially 
re-initialized after every episode, effectively extra explorations are performed. 
Therefore, there were extra penalties on the performance, and it can be clearly 
observed in the cumulative regret. 


Fig. 1. Cumulative Regret of Various Modifications on Improved UCB Algorithm 



Plays 


We can further see that by making only the algorithmic modification, to 
give Modified I-UCB (no r) and Modified I-UCB(no r, episodic), the optimal 
action percentage increases very rapidly, but it eventually plateaued and stuck 
to suboptimal arms. Their cumulative regret also increased linearly instead of 
logarithmically. 

However, by adding the factor r, to the confidence bound, the optimal action 
percentage increases rapidly and might even overtake the UCB algorithm if more 
plays are given. Although the optimal action percentage of the modified improved 
UCB, both Modified I-UCB and Modified I-UCB (episodic), are rapidly catching 
up with that of the UCB algorithm; there is still a significant gap between their 
cumulative regret. 
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Fig. 2. Optimal Arm Percentage of of Various Modifications on Improved UCB Algo¬ 
rithm 
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3.2 Performance of Mi-UCT against Plain UCT on 9x9 Go 

We will demonstrate the performance of the Mi-UCT algorithm against the plain 
UCT algorithm on the game of Go played on a 9 x 9 board. 

For an effective comparison of the two algorithms, no performance enhancing 
heuristics were applied. The simulations are all pure random simulations without 
any patterns or simulation policies. A total of 1000 games were played for each 
constant C setting of the UCT algorithm, each taking turns to play Black. The 
total number of playouts was fixed to 1000, 3000, and 5000 for both algorithms. 

The results are shown in Table [1] It can be observed that the performance of 
the Mi-UCT algorithm is quite stable against various constant C settings of the 
plain UCT algorithm, and is roughly on the same level. The Mi-UCT algorithm 
seems to have better performance when only 1000 playouts are given, but slightly 
deteriorates when more playouts are available. 


Table 1. Win rate of Mi-UCT against plain UCT on 9 x 9 Go 


constant C 

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 

1000 playouts 
3000 playouts 
5000 playouts 

57.1% 55.2% 57.5% 52.2% 58.6% 58.4% 55.8% 55.3% 54.5% 
50.8% 50.9% 50.3% 52.2% 52.2% 54.4% 56.5% 56.0% 54.1% 
54.3% 54.2% 52.4% 51.0% 52.4% 57.5% 54.9% 56.1% 55.3% 


3.3 Performance of Mi-UCT against Plain UCT on 9x9 NoGo 

We will demonstrate the performance of the the Mi-UCT algorithm against the 
plain UCT algorithm on the game of NoGo played on a 9 x 9 board. NoGo is 
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a misere version of the game of Go, in which the first player that has no legal 
moves other than capturing the opponent’s stone loses. 

All the simulations are all pure random simulations, and no extra heuristics 
or simulation policies were applied. A total of 1000 games were played for each 
constant C setting of the UCT algorithm, each taking turns to play Black. The 
total number of playouts was fixed to 1000, 3000, and 5000 for both algorithms. 

The results are shown in Tabled We can observe that the Mi-UCT algorithm 
significantly dominates the plain UCT algorithm when only 1000 playouts were 
given, and the performance deteriorates rapidly when more playouts are avail¬ 
able, although it is still roughly on the same level as the plain UCT algorithm. 

The results on both 9 x 9 Go and 9x9 NoGo suggest that the performance 
of the Mi-UCT algorithm is comparable to that of the plain UCT algorithm, but 
scalability seems poorer. Since the proposed modified improved UCB algorithm 
essentially estimates the expected reward of each bandit by Wi + where 

r'i = i/p", the exploration term converges slower than the of the UCB algorithm, 
and hence more exploration might be needed for the modified improved UCB 
confidence bounds to converge to a “good-enough” estimate value; this might 
be the reason why Mi-UCT algorithm has poor scalability. Therefore, we might 
able to overcome this problem by trying other definitions for ry. 


Table 2. Win rate of Mi-UCT against plain UCT on 9 x 9 NoGo 


constant C 

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 

1000 playouts 
3000 playouts 
5000 playouts 

58.5% 56.1% 61.4% 56.7% 57.4% 58.4% 59.6% 56.9% 57.8% 
50.3% 51.4% 53.1% 51.0% 49.6% 54.4% 56.0% 54.2% 53.9% 
45.8% 48.8% 48.5% 49.6% 55.1% 51.3% 51.3% 55.0% 52.7% 


4 Conclusion 

The improved UCB algorithm is a modification of the UCB algorithm, and has 
a better regret upper bound than the UCB algorithm. Various characteristics 
of the improved UCB algorithm, such as early exploration and not being an 
anytime algorithm, are not ideal for a direct application to MCTS. Therefore, 
we have made some modifications to the improved UCB algorithm, making it 
more suitable for the task of game tree search. We have investigated the impact 
and implications of each modification through an empirical study under the 
conventional multi-armed bandit problem setting. 

The Mi-UCT algorithm is the application of the modified improved UCB 
algorithm applied to Monte-Carlo Tree Search. We have demonstrated that it 
outperforms the plain UCT algorithm on both games of 9 x 9 Go and 9x9 
NoGo when only a small number of playouts are given, and on comparable level 
with increased playouts. One possible way of improving the scalability would be 
trying other definition of r, in the modified improved UCB confidence bounds. 
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It would also be interesting to investigate the possibility of enhancing the 
performance of the Mi-UCT algorithm by combining it with commonly used 
heuristics [6] or develop new heuristics that are unique to the Mi-UCT algorithm. 
Finally, since the modifications we have made essentially changed the behaviour 
of the original algorithm, investigation into the theoretical properties of our 
modified improved UCB algorithm may provide further insight into the relation 
between bandit algorithms and Monte-Carlo Tree Search. 
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